10 research outputs found
Describing Common Human Visual Actions in Images
Which common human actions and interactions are recognizable in monocular
still images? Which involve objects and/or other people? How many is a person
performing at a time? We address these questions by exploring the actions and
interactions that are detectable in the images of the MS COCO dataset. We make
two main contributions. First, a list of 140 common `visual actions', obtained
by analyzing the largest on-line verb lexicon currently available for English
(VerbNet) and human sentences used to describe images in MS COCO. Second, a
complete set of annotations for those `visual actions', composed of
subject-object and associated verb, which we call COCO-a (a for `actions').
COCO-a is larger than existing action datasets in terms of number of actions
and instances of these actions, and is unique because it is data-driven, rather
than experimenter-biased. Other unique features are that it is exhaustive, and
that all subjects and objects are localized. A statistical analysis of the
accuracy of our annotations and of each action, interaction and subject-object
combination is provided
Benchmarking and Error Diagnosis in Multi-Instance Pose Estimation
We propose a new method to analyze the impact of errors in algorithms for
multi-instance pose estimation and a principled benchmark that can be used to
compare them. We define and characterize three classes of errors -
localization, scoring, and background - study how they are influenced by
instance attributes and their impact on an algorithm's performance. Our
technique is applied to compare the two leading methods for human pose
estimation on the COCO Dataset, measure the sensitivity of pose estimation with
respect to instance size, type and number of visible keypoints, clutter due to
multiple instances, and the relative score of instances. The performance of
algorithms, and the types of error they make, are highly dependent on all these
variables, but mostly on the number of keypoints and the clutter. The analysis
and software tools we propose offer a novel and insightful approach for
understanding the behavior of pose estimation algorithms and an effective
method for measuring their strengths and weaknesses.Comment: Project page available at
http://www.vision.caltech.edu/~mronchi/projects/PoseErrorDiagnosis/; Code
available at https://github.com/matteorr/coco-analyze; published at ICCV 1
Vision for Social Robots: Human Perception and Pose Estimation
In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene.
The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention.
First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error.
Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images.
Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p
It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data
We address the problem of 3D human pose estimation from 2D input images using
only weakly supervised training data. Despite showing considerable success for
2D pose estimation, the application of supervised machine learning to 3D pose
estimation in real world images is currently hampered by the lack of varied
training images with corresponding 3D poses. Most existing 3D pose estimation
algorithms train on data that has either been collected in carefully controlled
studio settings or has been generated synthetically. Instead, we take a
different approach, and propose a 3D human pose estimation algorithm that only
requires relative estimates of depth at training time. Such training signal,
although noisy, can be easily collected from crowd annotators, and is of
sufficient quality for enabling successful training and evaluation of 3D pose
algorithms. Our results are competitive with fully supervised regression based
approaches on the Human3.6M dataset, despite using significantly weaker
training data. Our proposed algorithm opens the door to using existing
widespread 2D datasets for 3D pose estimation by allowing fine-tuning with
noisy relative constraints, resulting in more accurate 3D poses.Comment: BMVC 2018. Project page available at
http://www.vision.caltech.edu/~mronchi/projects/RelativePos
A Rotation Invariant Latent Factor Model for Moveme Discovery from Static Poses
We tackle the problem of learning a rotation invariant latent factor model when the training data is comprised of lower-dimensional projections of the original feature space. The main goal is the discovery of a set of 3-D bases poses that can characterize the manifold of primitive human motions, or movemes, from a training set of 2-D projected poses obtained from still images taken at various camera angles. The proposed technique for basis discovery is data-driven rather than hand-designed. The learned representation is rotation invariant, and can reconstruct any training instance from multiple viewing angles. We apply our method to modeling human poses in sports (via the Leeds Sports Dataset), and demonstrate the effectiveness of the learned bases in a range of applications such as activity classification, inference of dynamics from a single frame, and synthetic representation of movements
It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data
We address the problem of 3D human pose estimation from 2D input images using only weakly supervised training data. Despite showing considerable success for 2D pose estimation, the application of supervised machine learning to 3D pose estimation in real world images is currently hampered by the lack of varied training images with associated 3D poses. Existing 3D pose estimation
algorithms train on data that has either been collected in carefully controlled
studio settings or has been generated synthetically. Instead, we take a
different approach, and propose a 3D human pose estimation algorithm that only requires relative estimates of depth at training time. Such training signal,
although noisy, can be easily collected from crowd annotators, and is of
sufficient quality for enabling successful training and evaluation of 3D pose.
Our results are competitive with fully supervised regression based approaches
on the Human3.6M dataset, despite using significantly weaker training data. Our proposed approach opens the door to using existing widespread 2D datasets for 3D pose estimation by allowing fine-tuning with noisy relative constraints, resulting in more accurate 3D poses
COCO-a
A large and well annotated dataset of actions on the current best image dataset for visual recognition, with rich annotations including all the actions performed by each person in the dataset, and the people and objects that are involved in each action, subject's posture and emotion, and high level visual cues such as mutual position and distance.Related Publication:
Describing Common Human Visual Actions in Images
Ronchi, Matteo Ruggero Caltech
Perona, Pietro Caltech
Proceedings of the British Machine Vision Conference (BMVC)
2015-06-07
https://doi.org/10.48550/arXiv.1506.02203
en
Caltech Multi-Distance Portraits (CMDP)
A dataset of frontal portraits of 53 individuals spanning a number of attributes (sex, age, race, hair), each photographed from seven distances.Related Publication:
Distance Estimation of an Unknown Person from a Portrait
Burgos-Artizzu, Xavier
Ronchi, Matteo Ruggero
Perona, Pietro
Proceedings of the European Conference on Computer Vision (ECCV)
2014-09-02
https://doi.org/10.1007/978-3-319-10590-1_21
en
Data for "It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data" (BMVC 2018)
Data for "It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data" (BMVC 2018)